Skip to content

fix(slurm): detect srun inside interactive salloc via SLURM_STEP_ID#21712

Open
1fanwang wants to merge 1 commit into
Lightning-AI:masterfrom
1fanwang:fix/slurm-srun-detection-in-salloc
Open

fix(slurm): detect srun inside interactive salloc via SLURM_STEP_ID#21712
1fanwang wants to merge 1 commit into
Lightning-AI:masterfrom
1fanwang:fix/slurm-srun-detection-in-salloc

Conversation

@1fanwang
Copy link
Copy Markdown

@1fanwang 1fanwang commented May 12, 2026

What does this PR do?

Fixes #20776.

_is_srun_used() returned False whenever SLURM_JOB_NAME was bash or interactive. Those values are inherited from an outer salloc allocation, but a nested srun python … inside that allocation IS a real srun invocation. The old check gave the wrong answer in that case, triggering a spurious "srun is available but is not used" warning and making SLURMEnvironment.detect() skip a real SLURM job.

The fix keys off SLURM_STEP_ID first — srun sets it, bare salloc does not. So an salloc → srun python … workflow is now correctly classified as srun-used. The pre-existing bash / interactive exclusion stays so salloc alone (no inner srun) still bails out cleanly.

How was this PR tested?

Added two tests in tests/tests_fabric/plugins/environments/test_slurm.py:

  • test_no_srun_warning_inside_salloc_with_srun — sets SLURM_JOB_NAME=interactive and SLURM_STEP_ID=42, asserts _is_srun_used() returns True and no spurious warning fires.
  • Extended test_detect to cover the same shape.

Both fail without the change and pass with it.

Before submitting

  • Was this discussed/agreed via a GitHub issue? (yes — slurm env incorrectly complains about srun with salloc interactive session. #20776)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (no doc change — internal helper)
  • Did you write any new necessary tests?
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (added under src/lightning/fabric/CHANGELOG.md > Unreleased > Fixed)

Documentation preview: https://pytorch-lightning--21712.org.readthedocs.build/en/21712/

Inside an interactive `salloc` allocation, `SLURM_JOB_NAME` is `bash` or
`interactive`, so `_is_srun_used()` returned `False` even when the script
was actually launched as `srun python ...`. This emitted a spurious
"srun is available but is not used" warning and made
`SLURMEnvironment.detect()` skip a real SLURM job.

`srun` sets `SLURM_STEP_ID` on the spawned process while `salloc` alone
does not, so checking for that variable distinguishes the two cases.

Fixes Lightning-AI#20776
@github-actions github-actions Bot added the fabric lightning.fabric.Fabric label May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fabric lightning.fabric.Fabric

Projects

None yet

Development

Successfully merging this pull request may close these issues.

slurm env incorrectly complains about srun with salloc interactive session.

1 participant